Loading MNIST model...

Adversarial Attack & Defense Visualization

Explore how adversarial attacks work and how to defend against them

Introduction
Interactive Demo
Defense Comparison
Learn More

Understanding Adversarial Attacks and Defenses

This interactive tool demonstrates how adversarial attacks can fool AI systems and how defensive techniques can make models more robust against these attacks.

What are Adversarial Attacks?

Adversarial attacks are specially crafted perturbations added to input data that cause machine learning models to make incorrect predictions. These perturbations are often imperceptible to humans but can completely change a model's output.

Original digit

Original Image (7)

Adversarial perturbation

+ Perturbation (amplified)

Adversarial example

= Adversarial Example (2)

Fast Gradient Sign Method (FGSM)

In this demo, we implement the Fast Gradient Sign Method (FGSM), a common adversarial attack. FGSM works by:

  1. Taking a correctly classified input image
  2. Computing the gradient of the loss with respect to the input
  3. Creating a perturbation by taking the sign of this gradient
  4. Adding this perturbation (scaled by epsilon) to the original image

The result is an "adversarial example" that looks almost identical to the original image to humans, but is misclassified by the model.

Defense Strategies

We'll explore several approaches to defend against adversarial attacks:

Adversarial Training

Training models on adversarial examples so they learn to resist attacks. Like immunization, exposing the model to attacks during training makes it more robust.

Input Preprocessing

Applying transformations to input images (like Gaussian noise) that disrupt adversarial perturbations while preserving key features for classification.

Ensemble Defense

Combining predictions from multiple models, making attacks harder because they need to fool all models simultaneously.

Get Started: Click on the "Interactive Demo" tab to create adversarial examples and test defense strategies!

Interactive Adversarial Attack Demo

In this interactive demo, you can generate adversarial examples using the FGSM attack and see how different defenses perform against them.

Step 1: Select an Image

Predicted: -

Step 2: Generate an Adversarial Example

0.1

Predicted: -

Perturbation (×5)

Defense Comparison

This section compares the effectiveness of different defense strategies against FGSM attacks of varying strengths.

Standard Model Low Robustness

The standard model has no defenses against adversarial attacks. It performs well on clean data but is highly vulnerable to adversarial examples.

87%
Vulnerable at ε=0.1
98%
Clean Accuracy

Adversarially Trained Model High Robustness

This model is trained on adversarial examples, teaching it to resist attacks. Like a vaccine, exposure to attacks during training improves immunity.

39%
Vulnerable at ε=0.1
97%
Clean Accuracy

Input Preprocessing Defense Medium Robustness

This defense adds random noise to inputs, which disrupts the carefully crafted adversarial perturbations while preserving key features.

65%
Vulnerable at ε=0.1
97%
Clean Accuracy

Ensemble Defense High Robustness

The ensemble combines predictions from multiple models, making attacks harder since they must fool all models simultaneously.

45%
Vulnerable at ε=0.1
98%
Clean Accuracy

Learn More: Adversarial Attacks & Defenses

Dive deeper into the concepts and techniques of adversarial machine learning.

Types of Adversarial Attacks

While this demo focuses on the FGSM attack, there are many other types of adversarial attacks:

  • Projected Gradient Descent (PGD): A more powerful iterative version of FGSM
  • Carlini & Wagner (C&W) Attack: An optimization-based attack that produces very effective adversarial examples
  • DeepFool: Finds the minimal perturbation needed to cross the decision boundary
  • Jacobian-based Saliency Map Attack (JSMA): Modifies only the most influential pixels

More Defense Strategies

Beyond the defenses demonstrated in this tool, researchers have developed several other approaches:

  • Defensive Distillation: Training a model to match the output of another model, making gradients harder to exploit
  • Randomized Smoothing: Adding random noise to inputs and averaging predictions to create certifiably robust classifiers
  • Feature Squeezing: Reducing the precision of inputs to remove adversarial perturbations
  • Gradient Masking/Obfuscation: Hiding gradients to make gradient-based attacks harder (though this can be bypassed)

Real-World Implications

Adversarial attacks have significant implications for AI security in real-world applications:

  • Autonomous Vehicles: Attackers could potentially place adversarial stickers on road signs to cause misclassification
  • Facial Recognition: Specially designed patterns on glasses or clothing could fool identity verification systems
  • Malware Detection: Adversarial techniques could help malware evade machine learning-based detection
  • Medical Diagnostics: Adversarial perturbations could cause misdiagnosis in AI-assisted medical imaging systems

Further Reading

  • "Intriguing properties of neural networks" - Szegedy et al. (2013) - First paper to identify the adversarial example phenomenon
  • "Explaining and Harnessing Adversarial Examples" - Goodfellow et al. (2014) - Introduced the FGSM attack
  • "Towards Deep Learning Models Resistant to Adversarial Attacks" - Madry et al. (2017) - Introduced PGD attacks and adversarial training
  • "Towards Evaluating the Robustness of Neural Networks" - Carlini & Wagner (2017) - Introduced the C&W attack
  • "Certified Robustness to Adversarial Examples with Differential Privacy" - Lecuyer et al. (2019) - Connection between robustness and privacy
?

Help & Information

What are adversarial attacks?

Adversarial attacks are inputs specifically designed to cause machine learning models to make mistakes. They work by adding carefully crafted perturbations that are often imperceptible to humans.

How does FGSM work?

The Fast Gradient Sign Method (FGSM) computes the gradient of the loss with respect to the input, then takes the sign of this gradient to create a perturbation. By adding this perturbation (scaled by epsilon) to the original image, it creates an adversarial example.

What is epsilon?

Epsilon controls the strength of the adversarial perturbation. Larger values create more visible perturbations that are more likely to fool the model but are also more noticeable to humans.

What are the defense strategies?

Can I test my own images?

Yes! You can use the "Draw your own" option to create and test your own handwritten digits.